This is a brief description of training process which has been used to get res10_300x300_ssd_iter_140000.caffemodel.
The model was created with SSD framework using ResNet-10 like architecture as a backbone. Channels count in ResNet-10 convolution layers was significantly dropped (2x- or 4x- fewer channels).
The model was trained in Caffe framework on some huge and available online dataset.

1. Prepare training tools
You need to use "ssd" branch from this repository https://github.com/weiliu89/caffe/tree/ssd . Checkout this branch and built it (see instructions in repo's README)

2. Prepare training data.
The data preparation pipeline can be represented as:

(a)Download original face detection dataset -> (b)Convert annotation to the PASCAL VOC format -> (c)Create LMDB database with images + annotations for training

a) Find some datasets with face bounding boxes annotation. For some reasons I can't provide links here, but you easily find them on your own. Also study the data. It may contain small or low quality faces which can spoil training process. Often there are special flags about object quality in annotation. Remove such faces from annotation (smaller when 16 along at least one side, or blurred, of highly-occluded, or something else).

b) The downloaded dataset will have some format of annotation. It may be one single file for all images, or separate file for each image or something else. But to train SSD in Caffe you need to convert annotation to PASCAL VOC format.
PASCAL VOC annotation consist of .xml file for each image. In this xml file all face bounding boxes should be listed as:

<annotation>
  <size>
    <width>300</width>
    <height>300</height>
  </size>
  <object>
    <name>face</name>
    <difficult>0</difficult>
    <bndbox>
      <xmin>100</xmin>
      <ymin>100</ymin>
      <xmax>200</xmax>
      <ymax>200</ymax>
    </bndbox>
  </object>
  <object>
    <name>face</name>
    <difficult>0</difficult>
    <bndbox>
      <xmin>0</xmin>
      <ymin>0</ymin>
      <xmax>100</xmax>
      <ymax>100</ymax>
    </bndbox>
  </object>
</annotation>

So, convert your dataset's annotation to the format above.
Also, you should create labelmap.prototxt file with the following content:
item {
  name: "none_of_the_above"
  label: 0
  display_name: "background"
}
item {
  name: "face"
  label: 1
  display_name: "face"
}

You need this file to establish correspondence between name of class and digital label of class.

For next step we also need file there all our image-annotation file names pairs are listed. This file should contain similar lines:
images_val/0.jpg annotations_val/0.jpg.xml

c) To create LMDB you need to use create_data.sh tool from caffe/data/VOC0712 Caffe's source code directory.
This script calls create_annoset.py inside, so check out what you need to pass as script's arguments

You need to prepare 2 LMDB databases: one for training images, one for validation images.

3. Train your detector
For training you need to have 3 files: train.prototxt, test.prototxt and solver.prototxt. You can find these files in the same directory as for this readme.
Also you need to edit train.prototxt and test.prototxt to replace paths for your LMDB databases to actual databases you've created in step 2.

Now all is done for launch training process.
Execute next lines in Terminal:
mkdir -p snapshot
mkdir -p log
/path_for_caffe_build_dir/tools/caffe train -solver="solver.prototxt" -gpu 0  2>&1 | tee -a log/log.log

And wait. It will take about 8 hours to finish the process.
After it you can use your .caffemodel from snapshot/ subdirectory in resnet_face_ssd_python.py sample.
